This page demonstrates a posibble workflow when dealing with prealigned sequences that we do not wish to realign, possibly because the method by which they were aligned is not availbale in ReproPhylo. In this case we wil analyse a SILVA alignment, which takes the secondary structure of the aligned sequences into consideration.
from reprophylo import *
The alignment consists of 18S rRNA sequences. We will start a project and define the the locus we are analysing:
pj = Project([Locus('dna','rRNA','18s',['18s'])])
print pj.loci[0]
In order to read an alignment and place it as such in the Project, we can use the Project method read_alignment.
Note: If we wanted to realign the sequences, we would then have read the same file with the read_denovo method, which would have reset the gaps and would place the data as sequence records instead of as a sequence alignment.
# Raising an error because of a mistake in the locus name
pj.read_alignment('data/arb-silva.de_2014-11-17_id215613_gapcompressedvertical.fasta','dna','rRNA','missing_locus')
# Raising an error because of an error in the feature type
pj.read_alignment('data/arb-silva.de_2014-11-17_id215613_gapcompressedvertical.fasta','dna','wrong_feature','18s')
# Raising an error because of an error in the character type
pj.read_alignment('data/arb-silva.de_2014-11-17_id215613_gapcompressedvertical.fasta','prot','rRNA','18s')
# Finaly getting it right
pj.read_alignment('data/arb-silva.de_2014-11-17_id215613_gapcompressedvertical.fasta','dna','rRNA','18s')
The read alignment places the data in several places. It will place degapped sequences as records in pj.records:
pj.records[:10]
The records will have a new source feature, as is the case when we read denovo data with read_denovo. However, unlike read_denovo, read_alignment will also add a normal sequence feature, with all the qualifiers usually add by ReproPhylo, such as feature_id and GC_content. The gene specified as an argument of read_alignment will be added to the gene qualifier of this new feature:
pj.write('starting_with_aln.gb')
genbank_file = !cat starting_with_aln.gb
for line in genbank_file[:100]:
print line
and as expected, the alignment object will be added to pj.alignments:
print pj.alignments['18s@ReadDirectly']
Using the show_aln method, we can now view the sequence alignment graphically. To make it informative we'll use the original description as labels. A small slice of the screen is pesented below.
pj.show_aln('18s@ReadDirectly', id=['source_original_desc'])
from IPython.display import Image
Image('Selection_011.png', width='700')
Well, the original description is not very convinient as it is too long. We have to edit the metadata a bit, either by printing a metadata CSV file, editing it and reading it back (see Tetillidae tutorial) or using Project method. We will copy the original description from the source features to the sequence features:
pj.add_qualifier_from_source('original_desc')
Let's see if it worked:
pj.write('starting_with_aln.gb')
genbank_file = !cat starting_with_aln.gb
for line in genbank_file[:100]:
print line
Now we have the descriptions in the feature's qualifiers. Let's use actual Biopython to turn them to concise 'organism' qualifiers:
for record in pj.records:
for feature in record.features:
feature.qualifiers['organism'] = [feature.qualifiers['original_desc'][0].split(';')[-1]]
Now we'll check again if our new qualifier was added to the features:
pj.write('starting_with_aln.gb')
genbank_file = !cat starting_with_aln.gb
for line in genbank_file[:100]:
print line
OK, we can now use the 'organism' qualifiers as labels to make a reasonablly looking sequence alignment:
pj.show_aln('18s@ReadDirectly', id=['organism'])
from IPython.display import Image
Image('Selection_009.png', width='700')
Next step, alignemnt trimming (see Tetillidae tutorial):
manual = TrimalConf(pj, method_name='manual' ,trimal_commands={'gapthreshold': '0.5'})
pj.trim([manual])
Checking out the alignment will confirm the trimming took place:
pj.show_aln('18s@ReadDirectly@manual', id=['organism'])
from IPython.display import Image
Image('Selection_010.png', width='700')
And tree reconstruction (again, see more details in the Tetillidae tutorial):
raxml = RaxmlConf(pj, method_name='fD_fb', preset='fD_fb', threads=6)
pj.tree([raxml])
The resulting tree will be rooted at midpoint and annotated with node supports:
supports = {'black':[100,99],
'dimgray':[99,75],
'silver':[75,50]}
pj.annotate('.', 'mid', 'mid', ['organism'], node_support_dict=supports, scale=2000)
Rather than viewing the png file, we can show the figure inline by fetching the tree object with the Project ft method and then render it with ETE's render method:
# The tree object retains the faces but not the tree style
# We need to exclude leaf names, but not the text faces or
# the brach support faces.
# We also do not get the supports legend
tstyle = TreeStyle()
tstyle.show_leaf_name = False
pj.ft('18s').render('%%inline', w=400, tree_style=tstyle)
It seems that we have some srange looking long branches in the tree. Lets use the full description as label to see the taxonomy of our OTUs and spot contaminations that were indicated in the SILVA description:
pj.clear_tree_annotations()
pj.annotate('.', 'mid', 'mid', ['source_original_desc'], node_support_dict=supports, scale=2000)
pj.ft('18s').render('%%inline', w=800, tree_style=tstyle)
There are two branches indicated to be a contamination and another one where the branch is long and the species is not identified. We are going to print the tree again, using the organism and the original SILVA id to help us list the IDs of the sequence we want to exclude. We will be able to use this list in the read_alignment method to automatically exclude them without actually editing the input alignment file.
pj.clear_tree_annotations()
pj.annotate('.', 'mid', 'mid', ['organism','source_original_id'], node_support_dict=supports, scale=2000)
pj.ft('18s').render('%%inline', w=800, tree_style=tstyle)
The list of IDs is now passed to the exclude keyword and they will not be read into the Project:
# Start a project with an 18S locus
pj = Project([Locus('dna','rRNA','18s',['18s'])])
# Read the SILVA alignemnt while excluding three sequences
pj.read_alignment('data/arb-silva.de_2014-11-17_id215613_gapcompressedvertical.fasta',
'dna','rRNA','18s',
exclude=['ABLG01003097.925.2722',
'CABB01001290.10393.12124',
'JQ806350.1.381'])
# Edit the qualifiers as above to create an organism qualifier
pj.add_qualifier_from_source('original_desc')
for record in pj.records:
for feature in record.features:
feature.qualifiers['organism'] = [feature.qualifiers['original_desc'][0].split(';')[-1]]
# Trim the alignment
manual = TrimalConf(pj, method_name='manual' ,trimal_commands={'gapthreshold': '0.5'})
pj.trim([manual])
# Build the tree
raxml = RaxmlConf(pj, method_name='fD_fb', preset='fD_fb', threads=6)
pj.tree([raxml])
#Annotate the tree as before
pj.annotate('.', 'mid', 'mid', ['organism'], node_support_dict=supports, scale=2000)
# Checkpoint the analysis
pickle_pj(pj, 'meloidogyne.ckp')
# Archive the analysis
publish(pj, 'meloidogyne', '.')
Here is the resulting tree without the excluded sequences. We are looking at the png file and not rendering the object because the object does not retain the tree style which also includes the scaling. Therefore, in the figure, the tree is wider. It seems that there are some incognita's that have long branches and the alignment should be closely inspected.
from IPython.display import Image
Image('340991416256292.48_18s@ReadDirectly@manual.png')